Lab 5: Intro to Machine Learning

Practice session

Luisa M. Mimmi —   https://luisamimmi.org/

February 17, 2025

GOAL OF TODAY’S PRACTICE SESSION

Lab # 5

  • In this Lab session, we will focus on Machine Learning (ML), as introduced in Lecture 5
  • We will review examples of both supervised and unsupervised ML algorithms
    • Supervised ML algorithms examples:
      • Logistic regression
      • Classification and regression trees (CART)
      • 🌳 Random Forest classifier
    • Unsupervised ML algorithms examples:
      • K-means Clustering
      • PCA for dimension reduction
    • (optional) PLS-DA for classification, a supervised ML alternative to PCA

🟠 ACKNOWLEDGEMENTS

The examples and datasets in this Lab session follow very closely two sources:

  1. The tutorial on “Data Analytics with R” by: Brian Machut, Nathan Cornwell

  2. The tutorial on “Principal Component Analysis (PCA) in R” by: Statistics Globe

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session

# --- General 
library(here)     # tools find your project's files, based on working directory
library(dplyr)    # A Grammar of Data Manipulation
library(skimr)    # Compact and Flexible Summaries of Data
library(magrittr) # A Forward-Pipe Operator for R 
library(readr)    # A Forward-Pipe Operator for R 
library(tidyr)    # Tidy Messy Data
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax

# ---Plotting & data visualization
library(ggplot2)      # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(scatterplot3d) # 3D Scatter Plot

# --- Statistics
library(MASS)       # Support Functions and Datasets for Venables and Ripley's MASS
library(factoextra) # Extract and Visualize the Results of Multivariate Data Analyses
library(FactoMineR) # Multivariate Exploratory Data Analysis and Data Mining
library(rstatix)    # Pipe-Friendly Framework for Basic Statistical Tests
library(car)        # Companion to Applied Regression
library(ROCR)       # Visualizing the Performance of Scoring Classifiers

# --- Tidymodels (meta package)
library(rsample)    # General Resampling Infrastructure  
library(broom)      # Convert Statistical Objects into Tidy Tibbles

LOGISTIC REGRESSION

(EXAMPLE of SUPERVISED ML ALGORITHM)

Logistic regr.: review

  • Logistic regression is a type of Generalized Linear Model (GLM): a more flexible version of linear regression that can work also for categorical response variables or count data (e.g. poisson regression).

  • Logistic regression is used to model a binary response variable: e.g. yes|no (or benign|malignant in the biopsy dataset).

  • Logistic regression fits the coefficients \(\beta_0\), \(\beta_1\), …, \(\beta_k\) to the data using the Maximum Likelihood Estimation method (unlike linear regression, which uses the Least Squares Estimation method).

Logistic regr.: logit function

If we have predictor variables like \(x_{1,i}\), \(x_{2,i}\), …, \(x_{k,i}\) and a binary response variable \(y_i\) (where \(y_i = 0\) or \(y_i = 1\)), we need a “special” function (or "link" function) able to transform the expected value of the response variable into the \([0,1]\) outcome we’re trying to predict.

If \(p_i\) is the probability that \(y_i = 1\)\[ logit (p_i) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i} \]

…the logit function is defined as:

\[ logit(p_i) = \ln\left( \frac{p_i}{1-p_i} \right) \]

DATASETS for today


In this tutorial, we will use:

Dataset on Heart Disease

Name: heart_data.csv
Documentation: Toy dataset prepared for teaching purposes. See reference on the data here Data Analytics with R
Sampling details: This dataset contains 10,000 observations on 4 variables.

# Use `here` in specifying all the subfolders AFTER the working directory 
heart_data <- read.csv(file = here::here("practice", "data_input", "05_datasets",
                                      "heart_data.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

heart_data variables with description

Variable Type Description
heart_disease int whether an individual has heart disease (1 = yes; 0 = no)
coffee_drinker int whether an individual drinks coffee regularly (1 = yes; 0 = no)
fast_food_spend dbl a numerical field corresponding to the annual spend of each individual on fast food
income dbl a numerical field corresponding to the individual’s annual income

heart dataset splitting

In Machine Learning, it is good practice to split the data into training and testing sets.

  • We will use the training set (70%) to fit the model and then the testing set (30%) to evaluate the model’s performance.
set.seed(123)

# Obtain 2 sub-samples from the dataset: training and testing
sample  <-  sample(c(TRUE, FALSE), nrow(heart_data), replace = TRUE , prob = c(0.7, 0.3) )
heart_train  <-  heart_data[sample,]
heart_test  <-  heart_data[!sample,]
  • Which results in:
# check the structure of the resulting datasets
dim(heart_train)
[1] 7048    4
dim(heart_test)
[1] 2952    4

Convert binary variables to factors

Before examining the training dataset heart_train, we converting the binary variables heart_disease and coffee_drinker to factors (for better readability).

heart_train <- heart_train %>% 
  # convert to factor with levels "Yes" and "No"
  mutate(heart_disease = factor(heart_disease, levels = c(0, 1),
                                labels = c("No_HD", "Yes_HD")),
         coffee_drinker = factor(coffee_drinker, levels = c(0, 1),
                                 labels = c("No_Coffee", "Yes_Coffee")) 
  )

# show the first 5 rows of the dataset
heart_train[1:5,]
  heart_disease coffee_drinker fast_food_spend    income
1         No_HD      No_Coffee        1823.816 44361.625
3         No_HD      No_Coffee        2683.873 31767.139
6         No_HD     Yes_Coffee        2298.971  7491.559
7         No_HD      No_Coffee        2063.783 24905.227
9         No_HD      No_Coffee        2902.645 37468.529

Plotting Y by X1 (continuous variable)

We now examine visualize the relationship between the binary outcome variable heart_disease and the continuous predictor variable fast_food_spend.

# plot the distribution of heart disease status by fast food spend
heart_train %>% 
  ggplot(aes(x = heart_disease, y = fast_food_spend, fill = heart_disease)) + 
  geom_jitter(aes(fill = heart_disease), alpha = 0.3, shape = 21, width = 0.25) +  
  scale_color_manual(values = c("#005ca1", "#9b2339")) + 
  scale_fill_manual(values = c("#57b7ff", "#e07689")) + 
  geom_boxplot(fill = NA, color = "black", linewidth = .7) + 
  coord_flip() +
    theme(plot.title = element_text(size = 13,face="bold", color = "#873c4a"),
        axis.text.x = element_text(size=12,face="italic"), 
        axis.text.y = element_text(size=12,face="italic"),
        legend.position = "none") + 
  labs(title = "Fast food expenditure by heart disease status") + 
  xlab("Heart Disease (Y/N)") + 
  ylab("Annual Fast Food Spend") 

Plotting Y by X1 (continuous variable)

The boxplots indicate that subjects who spend higher amounts annually on fast food show higher likelihood of heart disease (HD =1)

Plotting Y by X2 (discrete variable)

Then we examine the relationship between the binary outcome variable heart_disease and the binary predictor variable coffee_drinker.

  • we use the handy count function from dplyr to count occurrences in categorical variable(s) combinations.
# Dataset manipulation
heart_train %>% 
  # count the unique values per each group from 2 categorical variables' combinations
  dplyr::count(heart_disease, coffee_drinker, name = "count_by_group") %>% 
  dplyr::group_by(coffee_drinker) %>% 
  dplyr::mutate(
    total_coffee_class = sum(count_by_group),
    proportion = count_by_group / total_coffee_class) %>% 
  dplyr::ungroup() %>% 
  # filter only those with heart disease
  dplyr::filter(heart_disease == "Yes_HD") %>% 
# Plot  
  ggplot(aes(x = coffee_drinker, y = proportion, fill = coffee_drinker)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = c("#57b7ff", "#e07689")) + 
  theme_minimal() +
  ylab("Percent with Heart Disease") +
  xlab("Coffee Drinker (Y/N)") +
  ggtitle("Figure 3: Percent of Coffee Drinkers with Heart Disease") +
  labs(fill = "Coffee Drinker") + 
  scale_y_continuous(labels = scales::percent)

Plotting Y by X2 (discrete variable)

Also drinking coffee seems associated to a higher likelihood of heart disease (HD =1)

Linear regression wouldn’twork!

We could technically use linear regression (Figure 1) to study likelihood of having Heart Disease in relation to risk factors, but it would not work:

  • 🤔 How should we interpret a continuous prediction when it is different from \(0\) or \(1\)?
# Hypothetical linear regression model
lin_reg_model <- lm(heart_disease ~ fast_food_spend + coffee_drinker, data = heart_data)
# Plot
ggplot(heart_data) +
  geom_point(aes(fast_food_spend, heart_disease, color = factor(heart_disease)), 
             alpha = 0.25, size = 1.5) + 
  scale_color_manual(values = c("0" = "#005ca1", "1" = "#9b2339")) +
  # regression line
  geom_abline(slope = coef(lin_reg_model)[2], 
              intercept = coef(lin_reg_model)[1], color = "black", size = 1.2, linetype = "dashed" ) + 
  labs(title = "Heart disease [0,1] against fast food spend (US$) + regression line (black dashed)") + 
  ylab("Heart Disease (Y/N)") + 
  xlab("Annual Fast Food Spend") 

Linear regression wouldn’twork!

Figure 1: Showing that a linear regression (back line) would fit poorly

Linear regression wouldn’twork!

  • Also, recall that linear regression models implies certain assumptions:
    • Linear relationship between Y and the predictors X1 and X2
    • Residuals must be 1) approximately normally distributed and 2) uncorrelated
    • Homoscedasticity: residuals should have constant variance
    • Non collinearity: predictor variables should not be highly correlated with each other



  • Assumptions are not met in this case
# diagnostic plots
par(mfrow=c(2,2))
plot(lin_reg_model)

Linear regression wouldn’twork!

Figure 2: Diagostic plots for a hypothetical linear regression model

Fitting a logistic regression model

  • We can instead fit a logistic regression model to the heart_train data using:
    • the glm function for Generalized Linear Models,
      • with argument family = binomial to specify logistic regression which will use logit as the link function,
      • specifying (as usual) the 3 predictor variables after ~.
# Fit a logistic regression model
heart_model <- glm(heart_disease ~ coffee_drinker + fast_food_spend + income,
                   data = heart_train, 
                   family = binomial(link = "logit"))

Model summary and coefficients

  • Table 1 (next) shows the model output, with the coefficient estimate for each predictor.
    • The broom::tidy function converts the model summary into a more readable data frame.
    • The odds ratio (= exponentiated coefficient estimate) is more interpretable than the coefficient itself.
# Convert model's output summary into data frame
heart_model_coef <- broom::tidy(heart_model) %>% 
  # improve readability of significance levels
  dplyr::mutate('signif. lev.' = case_when(
    `p.value` < 0.001 ~ "***",
    `p.value` < 0.01 ~ "**",
    `p.value` < 0.05 ~ "*",
    TRUE ~ ""))%>%
  # add odds ratio column
  dplyr::mutate(odds_ratio = exp(estimate)) %>%
  dplyr::relocate(odds_ratio, .after = estimate) %>%
  dplyr::mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  # format as table
  knitr::kable() %>% 
  # reduce font size
  kable_styling(font_size = 20) %>% 
  # add table title
  kableExtra::add_header_above(c("Logistic Regression Analysis of Heart Disease Risk Factors"= 7))

heart_model_coef

Model summary and coefficients

Table 1: Complete logistic regression model

+ estimates of coefficients are in the form of natural logarithm of the odds of the event happening (Heart Disease)
+ positive estimate indicates an increase in the odds of having Heart Desease
+ negative estimate indicates a decrease in the odds of having Heart Desease
+ odds_ratios = the exponentiated coefficient estimate

Logistic Regression Analysis of Heart Disease Risk Factors
term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.2910 -2.5071 0.0122 *
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0.0000 ***
income 0.0000 1.0000 0.0000 -0.2299 0.8182

Interpreting the logistic coefficients

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.2910 -2.5071 0.0122 *
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0.0000 ***
income 0.0000 1.0000 0.0000 -0.2299 0.8182
  • Intercept: gives the log-odds of heart disease when all predictor variables are zero. While not our focus, the highly negative value suggests very low baseline probability of heart disease in the sample.
  • Income: Based on a p-value = 0.8182, we conclude that income is not significantly associated with heart disease. (Anyhow, the odds ratio of ≈1 would suggest no change in odds based on income.)

Interpreting the logistic coefficients

term estimate odds_ratio std.error statistic p.value signif. lev.
(Intercept) -11.0554 0.0000 0.6040 -18.3051 0.0000 ***
coffee_drinkerYes_Coffee -0.7296 0.4821 0.2910 -2.5071 0.0122 *
fast_food_spend 0.0024 1.0024 0.0001 20.6018 0.0000 ***
income 0.0000 1.0000 0.0000 -0.2299 0.8182
  • Coffee Drinker (=YES): the negative coefficient of -0.7296 (log-odds) means that coffee drinking is associated with lower odds of having heart disease. Specifically, it decreases the odds ratio of heart disease by e^0.7296 which means (relative to non-coffee drinkers) an \(OR =\) 0.48 \(=\) 48%, holding all other predictors fixed. This effect is statistically significant as p-value = 0.0122.
  • Fast Food Spend: the positive coefficient of 0.0024 (log-odds) means that fast food spend is associated with increased odds of having heart disease. Given this is a continuous variable, this means that for every 1-unit increase in fast food spending, the odds of heart disease increase by (e^0.0024 - 1) \(=\) (1.0024 - 1), i.e. by 0.24%. This effect is highly significant as p-value = 0.0000.
    • But we should consider a more adequate scale! (relative to fast food annual spending):

      • Δ+ = 1 USD: \((1.0024)^{1} \approx 1.0024 \quad \text{hence (1.0024 - 1)x100 = 0.24% increase in odds}\)

      • Δ+ = 10 USD: \((1.0024)^{10} \approx 1.024 \quad \text{hence (1.024 - 1)x100 = 2.4% increase in odds}\)

      • Δ+ = 100 USD: \((1.0024)^{100} \approx 1.27 \quad \text{hence (1.27 - 1)x100 = 27% increase in odds}\)

Wait, is drinking coffee good or bad? 🤔

Our plot above showed that there was a higher proportion of coffee drinkers with heart disease as compared to non coffee drinkers. However, our model told us that coffee drinking is associated with a decrease in the likelihood of having heart disease.
How can that be?

It’s because coffee drinking and fast food spend are correlatedso, on it’s own, it would appear as if coffee drinking were associated with heart disease, but this is only because coffee drinking is also associated with fast food spend, which our model tells us is the real contributor to heart disease.

🖍️🖍️ qui

Making predictions from logistic regression model

# Make predictions
heart_train$heart_disease_pred <- predict(heart_model, type = "response")

Converting predictions into classifications

# Convert predictions to classifications
heart_train$heart_disease_pred_class <- ifelse(heart_train$heart_disease_pred > 0.5, 1, 0)

Evaluating the model]

ROC curve]

# Load the pROC package
library(pROC)

# Create a ROC curve
roc_curve <- roc(heart_train$heart_disease, heart_train$heart_disease_pred)

# Plot the ROC curve
plot(roc_curve, col = "blue", main = "ROC Curve", legacy.axes = TRUE)

Confusion matrix]

# Confusion matrix
confusion_matrix <- table(heart_train$heart_disease, heart_train$heart_disease_pred_class)

confusion_matrix
        
            0    1
  No_HD  6790   30
  Yes_HD  148   80

Sensitivity and Specificity]

# Sensitivity
sensitivity <- confusion_matrix[2, 2] / sum(confusion_matrix[2, ])

# Specificity
specificity <- confusion_matrix[1, 1] / sum(confusion_matrix[1, ])

🔴🟠🟡🟢🔵🟣

Dataset on Breast Cancer Biopsy]

Name: Biopsy Data on Breast Cancer Patients
Documentation: See reference on the data downloaded and conditioned for R here https://cran.r-project.org/web/packages/MASS/MASS.pdf
Sampling details: This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. The dataset contains the original Wisconsin breast cancer data with 699 observations on 11 variables.

Importing Dataset biopsy]

  • The data can be interactively obtained form the MASS R package
# (after loading pckg)
# library(MASS)  

# I can call 
utils::data(biopsy)

biopsy variables with description]

Variable Type Description
id character Sample id
V1 integer 1 - 10 clump thickness
V2 integer 1 - 10 uniformity of cell size
V3 integer 1 - 10 uniformity of cell shape
V4 integer 1 - 10 marginal adhesion
V5 integer 1 - 10 single epithelial cell size
V6 integer 1 - 10 bare nuclei (16 values are missing)
V7 integer 1 - 10 bland chromatin
V8 integer 1 - 10 normal nucleoli
V9 integer 1 - 10 mitoses
class factor benign or malignant

biopsy variables exploration]

The biopsy data contains 699 observations of 9 continuous variables, V1, V2, …, V9.

The dataset also contains a character variable: id, and a factor variable: class, with two levels (“benign” and “malignant”).

# check variable types
str(biopsy)
'data.frame':   699 obs. of  11 variables:
 $ ID   : chr  "1000025" "1002945" "1015425" "1016277" ...
 $ V1   : int  5 5 3 6 4 8 1 2 2 4 ...
 $ V2   : int  1 4 1 8 1 10 1 1 1 2 ...
 $ V3   : int  1 4 1 8 1 10 1 2 1 1 ...
 $ V4   : int  1 5 1 1 3 8 1 1 1 1 ...
 $ V5   : int  2 7 2 3 2 7 2 2 2 2 ...
 $ V6   : int  1 10 2 4 1 10 10 1 1 1 ...
 $ V7   : int  3 3 3 3 3 9 3 3 1 2 ...
 $ V8   : int  1 2 1 7 1 7 1 1 1 1 ...
 $ V9   : int  1 1 1 1 1 1 1 1 5 1 ...
 $ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

biopsy missing data]

There is one incomplete variable V6 = “bare nuclei” with 16 missing values.

  • remember the package skimr for exploring a dataframe?
# check if vars have missing values
biopsy %>% 
  # select only variables starting with "V"
  skimr::skim(starts_with("V")) %>%
  dplyr::select(skim_variable, 
                n_missing)
# A tibble: 9 × 2
  skim_variable n_missing
  <chr>             <int>
1 V1                    0
2 V2                    0
3 V3                    0
4 V4                    0
5 V5                    0
6 V6                   16
7 V7                    0
8 V8                    0
9 V9                    0

biopsy missing data options]

We can decide what to do in these cases (informed by our knowledge of the dataset):

  • Option 1) We drop the observation with incomplete data (i.e. with missing values for V6 = “bare nuclei”) with 16 missing values.
# remove rows with missing values
biopsy_drop <- biopsy %>% 
  dplyr::filter(!is.na(V6))

mean(biopsy_drop$V6)
[1] 3.544656
  • Option 2) We impute the missing values with the mean of the variable V6 = “bare nuclei”.
# impute missing values with the median of the variable
biopsy_impute <- biopsy %>% 
  dplyr::mutate(V6 = ifelse(is.na(V6), median(V6, na.rm = TRUE), V6))

mean(biopsy_impute$V6)
[1] 3.486409

biopsy dataset exploration]

  • Biopsied cells of 700 breast cancer tumors, used to determine if the tumors were benign or malignant.
  • This determination was based on 9 characteristics of the cells, ranked from 1(benign) to 10(malignant):
    • 1) Clump Thickness – How the cells aggregate. If monolayered they are benign and if clumped on top of each other they are malignant
    • 2) Uniform Size – All cells of the same type should be the same size.
    • 3) Uniform Shape If cells vary in cell shape they could be malignant
    • 4) Marginal Adhesion – Healthy cells have a strong ability to stick together whereas cancerous cells do not
    • 5) Single Epithelial Size – If epithelial cells are not equal in size, it could be a sign of cancer
    • 6) Bare nuclei – If the nucleus of the cell is not surrounded by cytoplasm, the cell could be malignant
    • 7) Bland Chromatin – If the chromatin’s texture is coarse the cell could be malignant
    • 8) Normal Nucleoli – In a healthy cell the nucleoli is small and hard detect via imagery. Enlarged nucleoli could be a sign of cancer
    • 9) Mitosis – cells that multiply at an uncontrollable rate could be malignant

biopsy dataset preparation]

  • The explanatory variable(s) (clump_thickness, …, mitosis) can be renamed for better readability
  • The observations with missing values (bare_nuclei) are removed for simplicity
  • Patient ID (id) can be dropped as it is not used in this analysis
# Create a clean version of the dataset
biopsy_clean <- biopsy %>%
  # rename the columns (new = old)
  rename(
    id                = ID,
    clump_thickness   =  V1,
    uniform_size      =  V2,
    uniform_shape     =  V3,
    marginal_adhesion =  V4,
    single_epith_size =  V5,
    bare_nuclei       =  V6,
    bland_chromatin   =  V7,
    normal_nuclei     =  V8,
    mitosis           =  V9,
    class             =  class) %>% 
  # remove rows with missing values
  na.omit(bare_nuclei) %>% 
  # remove the id column
  select(-id)

# check the structure of the dataset
paint::paint(biopsy_clean)
data.frame [683, 10]
clump_thickness   int 5 5 3 6 4 8
uniform_size      int 1 4 1 8 1 10
uniform_shape     int 1 4 1 8 1 10
marginal_adhesion int 1 5 1 1 3 8
single_epith_size int 2 7 2 3 2 7
bare_nuclei       int 1 10 2 4 1 10
bland_chromatin   int 3 3 3 3 3 9
normal_nuclei     int 1 2 1 7 1 7
mitosis           int 1 1 1 1 1 1
class             fct benign benign benign benign benign ma~

biopsy sample splitting]

In Machine Learning, it is good practice to split the data into training and testing sets.

  • We will use the training set (80%) to fit the model and then the testing set (20%) to evaluate the model’s performance.
set.seed(123)

# Obtain 2 sub-samples from the dataset: training and testing
sample  <-  sample(c(TRUE, FALSE), nrow(biopsy_clean), replace = TRUE , prob = c(0.8, 0.2) )
biopsy_train  <-  biopsy_clean[sample,]
biopsy_test  <-  biopsy_clean[!sample,]
  • Which results in:
# check the structure of the resulting datasets
dim(biopsy_train)
[1] 542  10
dim(biopsy_test)
[1] 141  10

Indipendent variables’ visualization]

  1. We create a new df biopsy_train2 (with only 3 columns)
  2. Then, in Figure 3, we visualize the distribution of the explanatory variables, where each is plotted between the two classes of the tumor.
# New df for plotting
biopsy_train2 <- data.frame(
  "level" = c(biopsy_train$clump_thickness, biopsy_train$uniform_size,
              biopsy_train$uniform_shape, biopsy_train$marginal_adhesion,
              biopsy_train$single_epith_size, biopsy_train$bare_nuclei,
              biopsy_train$bland_chromatin, biopsy_train$normal_nuclei,
              biopsy_train$mitosis),
  "type" = c("Clump Thickness", "Uniform Size", "Unifrom Shape",
             "Marginal Adhesion", "Single Epithilial Size", "Bare Nuclei",
             "Bland Chromatin", "Normal Nuclei", "Mitosis"), 
  "class" = c(biopsy_train$class))

# Plot
ggplot(biopsy_train2, aes(x = level, y = class , colour = class)) + 
  geom_boxplot(fill = NA) +
  scale_color_manual(values = c("#005ca1", "#9b2339")) + 
  geom_jitter(aes(fill = class), alpha = 0.25, shape = 21, width = 0.2) +  
  scale_fill_manual(values = c("#57b7ff", "#e07689")) +  
  facet_wrap(~type, scales = "free") +  
  theme(plot.title = element_text(size = 13,face="bold", color = "#873c4a"),
        axis.text.x = element_text(size=12,face="italic"), 
        axis.text.y = element_text(size=12,face="italic"),
        legend.position = "none") + 
  labs(title = "Distribution of each explanatory variable by tumor class (benign/malignant) in samples") + 
  ylab(label = "") + xlab(label = "")

Indipendent variables’ visualization]

Figure 3: Boxplot of the independent variables
- values are consitently higher & more dispersed for malignant tumors
- values between 1 and 2 are classified as benign and values greater than 2 are classified as malignant

Logistic regr.: model fitting]

  • We fit a logistic regression model to the biopsy_train data using:
    • the glm function with argument family = binomial to specify the logistic regression model;
    • and with Class ~ . to specify an initial model that uses all the variables as predictors (backward elimination approach).
# Building initial model 
model = stats::glm(class ~ . , family = binomial, data=biopsy_train)
  • Table 2 shows the model summary, with the coefficient estimate for each predictor.
    • the broom::tidy function converts the model summary into a data frame.
broom::tidy(model) %>% 
  mutate('Sign.lev' = case_when(
    `p.value` < 0.001 ~ "***",
    `p.value` < 0.01 ~ "**",
    `p.value` < 0.05 ~ "*",
    TRUE ~ ""))%>%
  mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  knitr::kable() 

Logistic regr.: model coefficients]

Table 2: Complete logistic regression model
+ coefficients are in the form of natural logarithm of the odds of the event happening
+ positive estimate indicates an increase in the odds of finding a malignant tumor
term estimate std.error statistic p.value signif. lev.
(Intercept) -9.4169 1.1637 -8.0921 0.0000 ***
clump_thickness 0.4984 0.1434 3.4758 0.0005 ***
uniform_size 0.0992 0.2356 0.4211 0.6737
uniform_shape 0.2809 0.2655 1.0580 0.2901
marginal_adhesion 0.2688 0.1285 2.0917 0.0365 *
single_epith_size 0.0800 0.1679 0.4765 0.6337
bare_nuclei 0.3446 0.0983 3.5054 0.0005 ***
bland_chromatin 0.4083 0.1859 2.1958 0.0281 *
normal_nuclei 0.2196 0.1283 1.7119 0.0869
mitosis 0.4356 0.3513 1.2397 0.2151

Logistic regr.: coefficients’ interpretation

In logistic regression, the coefficients are in the form of the natural logarithm of the odds of the response event happening (i.e. \(Y_i = 1\)):

\[logit(p_i) = \ln\left( \frac{p_i}{1-p_i} \right) = -9.5063 + 0.3935 \times Clump\_Thickness + ... + 0.5065 \times Mitosis\]

However, with some algebraic transformation, the logit function can be inverted to obtain the probability of the response event happening as a function of the predictors:

\[p_i = \frac{1}{1 + e^{-(-9.5063 + 0.3935 \times Clump\_Thickness + ... + 0.5065 \times Mitosis)}}\]

This equation represents the logistic regression model’s best-fit line.

Logistic regr.: multicollinearity]

Let’s check for collinearity using the VIF function from the ‘car’ package.

  • A Variance Inflation Factor \(VIF > 5\) indicates that there could be correlation between predictor variables.

  • The VIF values are all less than 5, which indicates that there is no severe correlation between predictor variables in the model. ✅

car::vif(model)
  clump_thickness      uniform_size     uniform_shape marginal_adhesion 
         1.165502          2.775020          2.799658          1.258659 
single_epith_size       bare_nuclei   bland_chromatin     normal_nuclei 
         1.456077          1.172923          1.266337          1.209717 
          mitosis 
         1.060126 

Logistic regr.: model performance]

Highlight key logistic model performance metrics using broom::glance() function:

  • AIC: A measure of model quality; lower values indicate a better fit with fewer parameters.
  • Null Deviance: A measure of model error; how well the response variable can be predicted by a model with only an intercept term.
  • Deviance: A measure of model error; how well the response variable can be predicted by a model with predictor variables.
    • lower values mean the model fits the data better!
broom::glance(model)[, c("AIC", "null.deviance","deviance")] %>% 
  # show only performance  metrics
  knitr::kable() 
Table 3: Key logistic model performance metrics
AIC null.deviance deviance
111.7472 710.4404 91.74724

Logistic regr.: improving the model]

  • The model includes all variables, but we could make it more parsimonious by removing variables that are not significant!
  • We use a statistic called the Akaike Information Criterion (AIC) to compare models.
    • AIC’s calculation gives a penalty for including additional variables.
    • The model with the lowest AIC is considered the best.
# For example let's fit a model without the variable `uniform_size`
model2 = glm(class~ .-uniform_size, family = binomial, data=biopsy_train)

According to the AIC values, the model2 seems better (AIC is lower).

# Compare the AIC values of the 2 models
tibble(Model = c("model", "model2"), 
       AIC = c(AIC(model), AIC(model2) )) %>% 
  kable()
Model AIC
model 111.7472
model2 109.9322

Logistic regr.: systematic model selection]

  • The MASS package’s function stepAIC enables to perform a systematic model selection (by AIC):
    • The direction argument specifies the direction of the stepwise search.
    • The trace argument (if set to TRUE) prints out all the steps.
  • The best_model has removed these variables:
    • uniform_shape
    • single_epith_size
    • normal_nuclei
  • The best_model has the lowest AIC value (from 100 to 98.5), despite a higher Residual Deviance than the full model (from 80 to 80.6), albeit by a very slight amount.
# Select the best model based on AIC
best_model <- MASS::stepAIC(model, direction = "both", trace = FALSE)

# Compare the AIC values of full and best model
tibble(Model = c("model", "best_model"), 
       AIC = c(AIC(model), AIC(model2)),
       Deviance = c(deviance(model), deviance(best_model))) %>% kable()

Logistic regr.: systematic model selection]

Model AIC Deviance
model 111.7472 91.74724
best_model 109.9322 92.22712

Logistic regr.: predicting on test data]

  • We can use the predict function to predict the class of the biopsy_test using the best_model.
    • biopsy_test_pred contains the probability that each of observation (in test data) is malignant.
  • The classification of the PredictedValue... can be done using different probability thresholds (0.5, 0.4, 0.3, etc.). which will affect the true and false positivity rates.:
    • PredictedValue_05 is the standard threshold of 0.5.
    • PredictedValue_04 is a more conservative threshold of 0.4.
    • PredictedValue_07 is a more aggressive threshold of 0.7.
# Fitted value for the test data 205 samples based on model
biopsy_test_pred <- predict(best_model, newdata = biopsy_test, type = "response")

# Convert the predicted probabilities into 2 predicted classes
ActualValue <- biopsy_test$class

# Different possible thresholds for the predicted probabilities
PredictedValue_05 <- if_else(biopsy_test_pred > 0.5, "pred_malignant", "pred_benign")
PredictedValue_04 <- if_else(biopsy_test_pred > 0.4, "pred_malignant", "pred_benign")
PredictedValue_07 <- if_else(biopsy_test_pred > 0.7, "pred_malignant", "pred_benign")

Logistic regr. predictions: confusion matrix]

In diagnosing malignant tumors, it is important to keep the false negative rate low as this would be telling someone who has a malignant tumor that it is benign.

  • At \(p = 0.7\), the model will predict more false positives than at \(p = 0.4\) (6, v. 4 FN) – which we DON’T WANT
    • in this situation \(p = 0.4\) is preferable.

Then, we can evaluate the model’s performance on the test data by building a confusion matrix

# Build the confusion matrix with p = 0.4
table(ActualValue=biopsy_test$class, PredictedValue_04) %>% 
  knitr::kable()
pred_benign pred_malignant
benign 97 2
malignant 1 41
# Build the confusion matrix with p = 0.7
table(ActualValue=biopsy_test$class, PredictedValue_07) %>% 
  knitr::kable()
pred_benign pred_malignant
benign 98 1
malignant 2 40

Logistic regr. predictions: accuracy]

  • We found that a cutoff of 0.4 gives a good balance of low false negatives while still maintaining a high true positive rate.

  • With the chosen threshold of \(p = 0.4\), we can calculate the model’s accuracy on the test data.

# Build the confusion matrix with p = 0.4
conf_matr_04 <- table(ActualValue=biopsy_test$class, PredictedValue_04)  

# Calculate the accuracy
accuracy <- sum(diag(conf_matr_04)) / sum(conf_matr_04)
accuracy
[1] 0.9787234

Logistic regr.: ROC curve]

biopsy_test_pred

Logistic regr.: ROC curve]

          4           5           8          11          16          20 
0.856141485 0.021801730 0.006638569 0.002740809 0.697791544 0.032716342 
         21          25          32          33          52          61 
0.997961706 0.002740809 0.004519927 0.998094980 0.381806237 0.890311717 
         67          69          70          89          90          91 
0.012240692 0.999981431 0.003471541 0.012240692 0.003929098 0.002740809 
        106         108         109         113         116         120 
0.941402362 0.992363246 0.002852352 0.999743776 0.004606417 0.010504861 
        128         134         139         142         149         155 
0.007445284 0.006121686 0.011678064 0.001901825 0.144563429 0.001152043 
        179         185         187         195         196         199 
0.012240692 0.988166650 0.989829799 0.007445284 0.012240692 0.001152043 
        201         208         212         225         226         228 
0.999880409 0.002740809 0.999653743 0.999855572 0.001777246 0.998393444 
        237         245         247         256         257         268 
0.999981069 0.002740809 0.999937031 0.933948013 0.003138053 0.859315512 
        269         270         272         280         286         306 
0.999916759 0.002740809 0.020062328 0.999271345 0.999995077 0.999301628 
        308         309         330         331         334         335 
0.002740809 0.998855592 0.998162104 0.993091237 0.971963147 0.991911245 
        341         344         348         354         361         366 
0.995690727 0.001152043 0.001849647 0.999461911 0.999999318 0.002932733 
        370         374         377         387         390         394 
0.003842243 0.017475517 0.001777246 0.988988418 0.023239762 0.001152043 
        400         405         415         416         418         427 
0.002492425 0.002622405 0.995754012 0.198484715 0.001777246 0.056025207 
        432         440         445         446         449         460 
0.044590613 0.008518553 0.054401619 0.001901825 0.001152043 0.018272953 
        471         472         476         485         487         497 
0.004835827 0.024892637 0.003138053 0.012487974 0.004835827 0.001152043 
        500         505         506         511         524         528 
0.007964004 0.001152043 0.003138053 0.001152043 0.997887796 0.012240692 
        535         542         544         546         549         553 
0.002932733 0.003138053 0.007964004 0.013089106 0.003138053 0.028155249 
        556         558         561         563         564         569 
0.091374216 0.015217068 0.020062328 0.002740809 0.004835827 0.946116385 
        577         583         585         587         590         596 
0.013089106 0.997256237 0.035930450 0.999994075 0.008518553 0.013089106 
        597         598         600         604         608         611 
0.011678064 0.042469039 0.026665424 0.987905231 0.001152043 0.999355762 
        614         622         624         630         633         635 
0.002932733 0.406899632 0.001152043 0.005173691 0.001152043 0.003138053 
        637         638         644         648         653         655 
0.999884288 0.061172200 0.001152043 0.002070039 0.016533076 0.007445284 
        659         668         672         673         677         678 
0.999488816 0.007445284 0.009959429 0.004519927 0.002613676 0.008518553 
        683         693         699 
0.025294048 0.003138053 0.990326447 
predicted.data <- data.frame(prob.of.malig=biopsy_test_pred, malig = biopsy_test$class)

predicted.data <- predicted.data[order(predicted.data$prob.of.malig, decreasing = F),]

predicted.data$rank <- 1:nrow(predicted.data)

plot_ROC <- ggplot(data=predicted.data, aes(x=rank, y=prob.of.malig)) +
  geom_point(aes(color=malig), alpha=1, shape=4, stroke=2) +
  xlab("Index") + 
  ylab("Predicted Probability of Tumor Being Malignant")  

plot_ROC
This shows why lowering the cutoff improves the accuracy of the model as some malignant tumors are being underestimated which would cause false negatives.

Logistic regr.: conclusions]

  • Uniform Size and Single Epithithial Size were not significant in predicting the malignancy of tumor cells so our model does not include these variables.

  • Our fitted model reduces the null deviance and AIC without impacting the residual deviance by a significant amount and is able to predict the testing dataset with >90% accuracy.

  • For further analysis, we could run the model multiple times because our original and revised model are similar.

  • New training and testing data would help confirm our results and help identify possible overfitting.

🟠 K-MEANS CLUSTERING: EXAMPLE of UNSUPERVISED ML ALGORITHM

…]

PCA: EXAMPLE of UNSUPERVISED ML ALGORITHM

Reducing high-dimensional data to a lower number of variables

biopsy dataset manipulation]

We will:

  • exclude the non-numerical variables (id and class) before conducting the PCA.

  • exclude the individuals with missing values using the na.omit() or filter(complete.cases() functions.

  • We can do both in 2 equivalent ways:


with base R (more compact)

# new (manipulated) dataset 
data_biopsy <- na.omit(biopsy[,-c(1,11)])

with dplyr (more explicit)

# new (manipulated) dataset 
data_biopsy <- biopsy %>% 
  # drop incomplete & non-integer columns
  dplyr::select(-ID, -class) %>% 
  # drop incomplete observations (rows)
  dplyr::filter(complete.cases(.))

biopsy dataset manipulation]

We obtained a new dataset with 9 variables and 683 observations (instead of the original 699).

# check reduced dataset 
str(data_biopsy)
'data.frame':   683 obs. of  9 variables:
 $ V1: int  5 5 3 6 4 8 1 2 2 4 ...
 $ V2: int  1 4 1 8 1 10 1 1 1 2 ...
 $ V3: int  1 4 1 8 1 10 1 2 1 1 ...
 $ V4: int  1 5 1 1 3 8 1 1 1 1 ...
 $ V5: int  2 7 2 3 2 7 2 2 2 2 ...
 $ V6: int  1 10 2 4 1 10 10 1 1 1 ...
 $ V7: int  3 3 3 3 3 9 3 3 1 2 ...
 $ V8: int  1 2 1 7 1 7 1 1 1 1 ...
 $ V9: int  1 1 1 1 1 1 1 1 5 1 ...

Calculate Principal Components

The first step of PCA is to calculate the principal components. To accomplish this, we use the prcomp() function from the stats package.

  • With argument “scale = TRUE” each variable in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components (just like option Autoscaling in MetaboAnalyst)
# calculate principal component
biopsy_pca <- prcomp(data_biopsy, 
                     # standardize variables
                     scale = TRUE)

Analyze Principal Components

Let’s check out the elements of our obtained biopsy_pca object

  • (All accessible via the $ operator)
names(biopsy_pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       

“sdev” = the standard deviation of the principal components

“sdev”^2 = the variance of the principal components (eigenvalues of the covariance/correlation matrix)

“rotation” = the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

“center” and “scale” = the means and standard deviations of the original variables before the transformation;

“x” = the principal component scores (after PCA the observations are expressed in principal component scores)

Analyze Principal Components (cont.)

We can see the summary of the analysis using the summary() function

  1. The first row gives the Standard deviation of each component, which can also be retrieved via biopsy_pca$sdev.
  2. The second row shows the Proportion of Variance, i.e. the percentage of explained variance.
summary(biopsy_pca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
                           PC8     PC9
Standard deviation     0.51062 0.29729
Proportion of Variance 0.02897 0.00982
Cumulative Proportion  0.99018 1.00000

Proportion of Variance for components]

  1. The row with Proportion of Variance can be either accessed from summary or calculated as follows:
# a) Extracting Proportion of Variance from summary
summary(biopsy_pca)$importance[2,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982 
# b) (same thing)
round(biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2), digits = 5)
[1] 0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982


The output suggests the 1st principal component explains around 65% of the total variance, the 2nd principal component explains about 9% of the variance, and this goes on with diminishing proportion for each component.

Cumulative Proportion of variance for components]

  1. The last row from the summary(biopsy_pca), shows the Cumulative Proportion of variance, which calculates the cumulative sum of the Proportion of Variance.
# Extracting Cumulative Proportion from summary
summary(biopsy_pca)$importance[3,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000 


Once you computed the PCA in R you must decide the number of components to retain based on the obtained results.

VISUALIZING PCA OUTPUTS

Scree plot

There are several ways to decide on the number of components to retain.

  • One helpful option is visualizing the percentage of explained variance per principal component via a scree plot.
    • Plotting with the fviz_eig() function from the factoextra package
# Scree plot shows the variance of each principal component 
factoextra::fviz_eig(biopsy_pca, 
                     addlabels = TRUE, 
                     ylim = c(0, 70))


Visualization is essential in the interpretation of PCA results. Based on the number of retained principal components, which is usually the first few, the observations expressed in component scores can be plotted in several ways.

Scree plot

The obtained scree plot simply visualizes the output of summary(biopsy_pca).

Principal Component Scores]

After a PCA, the observations are expressed as principal component scores.

  1. We can retrieve the principal component scores for each Variable by calling biopsy_pca$x, and store them in a new dataframe PC_scores.
  2. Next we draw a scatterplot of the observations – expressed in terms of principal components
# Create new object with PC_scores
PC_scores <- as.data.frame(biopsy_pca$x)
head(PC_scores)

It is also important to visualize the observations along the new axes (principal components) to interpret the relations in the dataset:

Principal Component Scores]

        PC1         PC2         PC3         PC4         PC5         PC6
1  1.469095 -0.10419679  0.56527102  0.03193593 -0.15088743 -0.05997679
2 -1.440990 -0.56972390 -0.23642767  0.47779958  1.64188188  0.48268150
3  1.591311 -0.07606412 -0.04882192  0.09232038 -0.05969539  0.27916615
4 -1.478728 -0.52806481  0.60260642 -1.40979365 -0.56032669 -0.06298211
5  1.343877 -0.09065261 -0.02997533  0.33803588 -0.10874960 -0.43105416
6 -5.010654 -1.53379305 -0.46067165 -0.29517264  0.39155544 -0.11527442
         PC7        PC8          PC9
1 -0.3491471 -0.4200360 -0.005687222
2  1.1150819 -0.3792992  0.023409926
3 -0.2325697 -0.2096465  0.013361828
4  0.2109599  1.6059184  0.182642900
5 -0.2596714 -0.4463277 -0.038791241
6 -0.3842529  0.1489917 -0.042953075

Principal Component Scores plot (adding label variable)]

  1. When data includes a factor variable, like in our case, it may be interesting to show the grouping on the plot as well.
  • In such cases, the label variable class can be added to the PC set as follows.
# retrieve class variable
biopsy_no_na <- na.omit(biopsy)
# adding class grouping variable to PC_scores
PC_scores$Label <- biopsy_no_na$class


The visualization of the observation points (point cloud) could be in 2D or 3D.

Principal Component Scores plot (2D)]

The Scores Plot can be visualized via the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud.
ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  ggtitle("Figure 1: Scores Plot") +
  theme_bw()

Principal Component Scores plot (2D)]

Figure 1 shows the observations projected into the new data space made up of principal components

Principal Component Scores (2D Ellipse Plot)]

Confidence ellipses can also be added to a grouped scatter plot visualized after a PCA. We use the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud;
  • the stat_ellipse() function is called to add the ellipses per biopsy group.
ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  stat_ellipse() + 
  ggtitle("Figure 2: Ellipse Plot") +
  theme_bw()

Principal Component Scores (2D Ellipse Plot)]

Figure 2 shows the observations projected into the new data space made up of principal components, with 95% confidence regions displayed.

Principal Component Scores plot (3D)]

A 3D scatterplot of observations shows the first 3 principal components’ scores.

  • For this one, we need the scatterplot3d() function of the scatterplot3d package;
  • The color argument assigned to the Label variable;
  • To add a legend, we use the legend() function and specify its coordinates via the xyz.convert() function.
# 3D scatterplot ...
plot_3d <- with(PC_scores, 
                scatterplot3d::scatterplot3d(PC_scores$PC1, 
                                             PC_scores$PC2, 
                                             PC_scores$PC3, 
                                             color = as.numeric(Label), 
                                             pch = 19, 
                                             main ="Figure 3: 3D Scatter Plot", 
                                             xlab="PC1",
                                             ylab="PC2",
                                             zlab="PC3"))

# ... + legend
legend(plot_3d$xyz.convert(0.5, 0.7, 0.5), 
       pch = 19, 
       yjust=-0.6,
       xjust=-0.9,
       legend = levels(PC_scores$Label), 
       col = seq_along(levels(PC_scores$Label)))

Principal Component Scores plot (3D)]

Figure 3 shows the observations projected into the new 3D data space made up of principal components.

Biplot: principal components v. original variables]

Next, we create another special type of scatterplot (a biplot) to understand the relationship between the principal components and the original variables.
In the biplot each of the observations is projected onto a scatterplot that uses the first and second principal components as the axes.

  • For this plot, we use the fviz_pca_biplot() function from the factoextra package
    • We will specify the color for the variables, or rather, for the “loading vectors”
    • The habillage argument allows to highlight with color the grouping by class
factoextra::fviz_pca_biplot(biopsy_pca, 
                repel = TRUE,
                col.var = "black",
                habillage = biopsy_no_na$class,
                title = "Figure 4: Biplot", geom="point")

Biplot: principal components v. original variables]

The axes show the principal component scores, and the vectors are the loading vectors

Interpreting biplot output

Biplots have two key elements: scores (the 2 axes) and loadings (the vectors). As in the scores plot, each point represents an observation projected in the space of principal components where:

  • Biopsies of the same class are located closer to each other, which indicates that they have similar scores referred to the 2 main principal components;
  • The loading vectors show strength and direction of association of original variables with new PC variables.

As expected from PCA, the single PC1 accounts for variance in almost all original variables, while V9 has the major projection along PC2.

Interpreting biplot output (cont.)

scores <- biopsy_pca$x

loadings <- biopsy_pca$rotation
# excerpt of first 2 components
loadings[ ,1:2] 
          PC1         PC2
V1 -0.3020626 -0.14080053
V2 -0.3807930 -0.04664031
V3 -0.3775825 -0.08242247
V4 -0.3327236 -0.05209438
V5 -0.3362340  0.16440439
V6 -0.3350675 -0.26126062
V7 -0.3457474 -0.22807676
V8 -0.3355914  0.03396582
V9 -0.2302064  0.90555729

Recap of the workshop’s content]

TOPICS WE COVERED

  1. Motivated the choice of learning/using R for scientific quantitative analysis, and lay out some fundamental concepts in biostatistics with concrete R coding examples.

  2. Consolidated understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.

  3. Discussed the relationship between any two variables, and introduce a widely used analytical tool: regression.

  4. Presented a popular ML technique for dimensionality reduction (PCA), performed both with MetaboAnalyst and R.

  5. Introduction to power analysis to define the correct sample size for hypotheses testing and discussion of how ML approaches deal with available data.

Final thoughts

  • While the workshop only allowed for a synthetic overview of fundamental ideas, it hopefully provided a solid foundation on the most common statistical analysis you will likely run in your daily work:

    • Thorough understanding of the input data and the data collection process
    • Univariate and bivariate exploratory analysis (accompanied by visual intuition) to form hypothesis
    • Upon verifying the assumptions, we fit data to hypothesized model(s)
    • Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
  • You should now have a solid grasp on the R language to keep using and exploring the huge potential of this programming ecosystem

  • We only scratched the surface in terms of ML classification and prediction models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis